In the previous section, we hinted at the existence of statistical tests to be able to assess if there is a difference in the average BMI estimates between two groups (e.g. national BMI values of women in 1985 and women in 2017). One such test is the Student’s \(t\)-test, which can be used to determine if two group means are different. However, as mentioned before, this test depends on certain assumption of our data.
In this section, we are going to talk about this test and other tests that do not require the same assumptions about our data.
Hypothesis testing
Let’s remind ourselves of one of our original questions,
Is there a difference between rural and urban BMI estimates around the world?
In hypothesis testing, we are interested in comparing two different hypotheses: a “null” hypothesis (can be thought of like a baseline e.g. the means between two groups are the same) compared to an “alternative” hypothesis (e.g. the means between two groups are different). We are going to ask if there is enough evidence in our data to reject the null hypothesis.
Let’s try to formalize this a bit.
In our case, we want to perform what is called a two means hypothesis test using a statistical test (e.g. a Student’s \(t\)-test). Suppose we call the true unknown means of the two groups \(\mu_1\) and \(\mu_2\), for group 1 and group 2, respectively. Then we can define the null hypothesis that there is no difference in the two means:
\[ H_0: \mu_1 = \mu_2 \]
In contrast, we also define an alternative hypothesis that there is a difference between the two means:
\[ H_a: \mu_1 \neq \mu_2 \]
The idea behind a hypothesis test is that we assume the null hypothesis is true and we use our data to help us identify if there is enough evidence to reject the null hypothesis.
This is similar to the idea of assuming that individuals are not guilty until proven otherwise. If there is not enough evidence in the data, then we say we “fail to reject the null hypothesis”.
You might be asking, “But we do not know what the true group means are?” That is correct. Instead, we must estimate these means based on the data that we have. For example, in our data, we can estimate differences in the average BMI estimates for women in rural and urban regions in the two different years. Here we will use the summarize() function from the dplyr package again, this time using the argument na.rm = TRUE to remove NA values.
# A tibble: 4 x 3
# Groups: Region [2]
Region Year avg_bmi
<chr> <chr> <dbl>
1 Rural 1985 23.6
2 Rural 2017 26.0
3 Urban 1985 24.6
4 Urban 2017 26.8
# A tibble: 4 x 4
# Groups: Region [2]
Region Year avg_bmi diff
<chr> <chr> <dbl> <dbl>
1 Rural 1985 23.6 NA
2 Rural 2017 26.0 2.39
3 Urban 1985 24.6 NA
4 Urban 2017 26.8 2.21
Here we see that that BMI estimates increased by 2.39 points within rural regions and increased by 2.21 points within urban regions.
Next, we can use a statistical test (e.g. a Student’s \(t\)-test) to assess if these differences are statistically meaningful or if these differences could be found due to random chance.
There are two possible classes of statistical tests that we could run to compare the means of these two groups:
- Parametric
- Nonparametric
Parametric tests are based on assumptions about the distribution of the underlying data, while nonparametric tests do not rely on these distributional assumptions.
See here for more information about the difference between these two classes of tests.
Parametric two-sample \(t\)-test
The two-sample \(t\)-test is a common way to determine if the means of two groups are different. The two-sample \(t\)-test however, relies on several assumptions:
- The data for each group is normally distributed.
- The variance of both groups is similar.
- The observations are independent (meaning that observations do not influence each other).
If these assumptions are violated, this doesn’t necessarily mean we can’t perform a \(t\)-test. It just means we may need to consider the following options:
- Transformation of the data to make it more normally distributed.
- Welch’s \(t\)-test also called the unequal variance \(t\)-test. We may need to modify the way we perform the \(t\)-test to account for the difference in the variance in the two groups.
- Permutation/resampling methods to deal with violations of normality.
- If we have data that is not independent and is what we call paired, we should considered the Paired \(t\)-test ]. We will explain this in more detail.
We can use a nonparametric test which does not rely on assumptions about the normality of the data. These tests are often a good option when multiple assumptions are violated are when sample sizes are small. We will explore these options.
It is also important to note that our data has a balance of observations for both groups - in fact they are equal. Balanced designs typically make the tests comparing means more powerful.
If our design was not balanced, we might want to consider using permutation methods to improve power, these methods are also a good option for violations of normality. To learn more about these methods see here.
If we needed to check if our samples were imbalanced, we could use the count() function of dplyr:
# A tibble: 12 x 4
Sex Year Region n
<chr> <chr> <chr> <int>
1 Men 1985 National 200
2 Men 1985 Rural 200
3 Men 1985 Urban 200
4 Men 2017 National 200
5 Men 2017 Rural 200
6 Men 2017 Urban 200
7 Women 1985 National 200
8 Women 1985 Rural 200
9 Women 1985 Urban 200
10 Women 2017 National 200
11 Women 2017 Rural 200
12 Women 2017 Urban 200
We can see that the number of observations for each possible group of interest is the same.
The \(t\)-test is also fairly robust to non-normality if the data is relatively large, due to what is called the central limit theorem, which states that as samples get larger, they approximate a normal distribution.
We have an n of 200, which should be sufficient but let’s investigate the nonparametric tests further.
Often we would check if the variance of the rural and urban data is equal using the var.test() function. However this is an F test and assumes that the data is normally distributed. Instead we will use the mood.test() function which performs the Mood’s two-sample test for a difference in scale parameters and does not assume that the data is normally distributed. We will also introduce the pull() function of the dplyr package.
Mood two-sample test of scale
data: dplyr::pull(filter(BMI_long, Sex == "Women", Year == "2017", and dplyr::pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and Region == "Urban"), BMI)
Z = 2.9189, p-value = 0.003513
alternative hypothesis: two.sided
# p value <.05, conclude that variance is not equal
# reject the null: no difference in the spread of the distributions
mood.test(pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Rural"), BMI),
pull(filter(BMI_long,
Sex == "Women",
Year == "1985",
Region == "Urban"), BMI))
Mood two-sample test of scale
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Rural"), BMI) and "Urban"), BMI)
Z = 3.1305, p-value = 0.001745
alternative hypothesis: two.sided
# p value <.05, conclude that variance is not equal
# reject the null: no difference in the variance of the distributions
mood.test(pull(filter(BMI_long,
Sex == "Women",
Year == "1985",
Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Rural"), BMI))
Mood two-sample test of scale
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and "Rural"), BMI)
Z = -0.24228, p-value = 0.8086
alternative hypothesis: two.sided
# p value >.05, conclude that variance is equal
# fail to reject the null the null: no difference in the variance of the distributions
mood.test(pull(filter(BMI_long,
Sex == "Women",
Year == "1985",
Region == "Urban"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Urban"), BMI))
Mood two-sample test of scale
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Urban"), BMI) and "Urban"), BMI)
Z = 1.5317, p-value = 0.1256
alternative hypothesis: two.sided
Our p value is less than .05 for both tests, thus we reject our null hypothesis that there is no difference in the variance. Therefore, we conclude that the variance is not equal and that our data also violates this assumption.
We will perform a special t.test where we account for the fact that our variance is not equal.
Another very important consideration is that the our data is what we call paired, meaning that the measurements from the rural and urban areas are not independent. That is because we have a rural and urban measurement mean for nearly every country. Thus these values may be more similar to one another if they come from the same country. This is also true for the male and female measurements from the same country or the values in the same countries from 1985 and later in 2017. However, we are assuming that measurements between different countries are independent, thus this assumption is not violated, making it reasonable to perform the paired \(t\)-test.
When we perform a paired \(t\)-test our hypothesis is slightly different from the typical student’s \(t\)-test. In this case we are testing the differences among the pairs of observations and how close these differences are to zero. Our null hypothesis is that the mean of the differences is equal to zero:
Ho: μd = 0
μd is the true mean differences
between paired observations of the two groups
The alternative hypothesis is that the mean of the differences is not equal to zero
Ha: μd ≠ 0
μd is the true mean differences
between paired observations of the two groups
t.test(pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Urban"), BMI),
var.equal = FALSE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and "Urban"), BMI)
t = -10.356, df = 194, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.0573625 -0.7190478
sample estimates:
mean of the differences
-0.8882051
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Urban"), BMI),
var.equal = FALSE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Rural"), BMI) and "Urban"), BMI)
t = -14.095, df = 195, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.1870263 -0.8956268
sample estimates:
mean of the differences
-1.041327
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Rural"), BMI),
var.equal = TRUE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and "Rural"), BMI)
t = -22.119, df = 195, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.591762 -2.167422
sample estimates:
mean of the differences
-2.379592
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Urban"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Urban"), BMI),
var.equal = TRUE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Urban"), BMI) and "Urban"), BMI)
t = -24.378, df = 198, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.383938 -2.027118
sample estimates:
mean of the differences
-2.205528
Question opportunity: Looking at the t value, was global BMI lower in Rural or Urban areas in 1985?
Now we will try transform our data to make it more normally distributed. One way to do this is to take the logarithm of the data values. Then we will see how this influences the results. Again we will focus on the data for women.


The data appears to be more similar to the normal distribution than it did before, although it still looks different from the theoretical normal distribution. Again however, our sample size of 200 is quite large and the \(t\)-test is generally quite robust to violations of normality with large n, thus the modified \(t\)-test to account for unequal variance might be a good option using the log normalized data, as it is at least more normally distributed.
Let’s see the results of the paired \(t\)-test with the transformed data:
t.test(pull(filter(BMI_long_log, Sex == "Women",
Year == "2017",
Region == "Rural"), log_BMI),
pull(filter(BMI_long_log, Sex == "Women",
Year == "2017",
Region == "Urban"), log_BMI),
var.equal = FALSE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long_log, Sex == "Women", Year == "2017", Region == and pull(filter(BMI_long_log, Sex == "Women", Year == "2017", Region == "Rural"), log_BMI) and "Urban"), log_BMI)
t = -10.058, df = 194, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.04242774 -0.02851589
sample estimates:
mean of the differences
-0.03547182
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long_log, Sex == "Women",
Year == "1985",
Region == "Rural"), log_BMI),
pull(filter(BMI_long_log, Sex == "Women",
Year == "1985",
Region == "Urban"), log_BMI),
var.equal = FALSE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long_log, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long_log, Sex == "Women", Year == "1985", Region == "Rural"), log_BMI) and "Urban"), log_BMI)
t = -13.962, df = 195, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.05214677 -0.03923811
sample estimates:
mean of the differences
-0.04569244
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long_log, Sex == "Women",
Year == "1985",
Region == "Rural"), log_BMI),
pull(filter(BMI_long_log, Sex == "Women",
Year == "2017",
Region == "Rural"), log_BMI),
var.equal = TRUE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long_log, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long_log, Sex == "Women", Year == "2017", Region == "Rural"), log_BMI) and "Rural"), log_BMI)
t = -22.369, df = 195, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.10617051 -0.08896626
sample estimates:
mean of the differences
-0.09756839
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long_log, Sex == "Women",
Year == "1985",
Region == "Urban"), log_BMI),
pull(filter(BMI_long_log, Sex == "Women",
Year == "2017",
Region == "Urban"), log_BMI),
var.equal = TRUE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long_log, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long_log, Sex == "Women", Year == "2017", Region == "Urban"), log_BMI) and "Urban"), log_BMI)
t = -23.977, df = 198, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.09377834 -0.07952498
sample estimates:
mean of the differences
-0.08665166
We can see that our results are quite similar to that of the original data, however the t values are slightly smaller. In other cases we may see a much more dramatic influence of transforming our data.
Now, let’s take a look at nonparametric tests, which are also a great option when the assumptions of the \(t\)-test are violated.
Nonparametric two sample tests
There are multiple nonparametric options to consider when the assumptions of the [\(t\)-test] are violated. The Wilcoxon signed rank test (for paired data - the alternative is Wilcoxon rank sum test (also called the Mann-Whitney U test) for independent samples) and the two-sample Kolmogorov-Smirnov test (KS) both do not assume normality (has both paired and unpaired methods). Thus these tests should be considered when the data of either groups does not appear to be normally distributed and particularly when the number of samples is low.
Importantly the KS test does not assume normality or equal variance, while the Wilcoxon signed rank test does assume equal variance. Here is how you would perform these tests. However in our case, because the variance is not equal between some of our groups of interest, the KS test would be more appropriate. Both the \(t\)-test and the KS test evaluate if the distributions of the two groups are identical, however the KS test does not particularly test any aspect of the distribution like the mean, therefore there are no confidence intervals in the output using this test.
ks.test(pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Urban"), BMI),
paired = TRUE)
Two-sample Kolmogorov-Smirnov test
data: pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and "Urban"), BMI)
D = 0.20006, p-value = 0.0007385
alternative hypothesis: two-sided
ks.test(pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Urban"), BMI),
paired = TRUE)
Two-sample Kolmogorov-Smirnov test
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Rural"), BMI) and "Urban"), BMI)
D = 0.19914, p-value = 0.0007779
alternative hypothesis: two-sided
What about the difference in female BMI from 1985 to 2017 for both regions? Recall that the variance was equal for these comparisons.
wilcox.test(pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Rural"), BMI),
paired = TRUE)
Wilcoxon signed rank test with continuity correction
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and "Rural"), BMI)
V = 273, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
wilcox.test(pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Urban"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Urban"), BMI),
paired = TRUE)
Wilcoxon signed rank test with continuity correction
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Urban"), BMI) and "Urban"), BMI)
V = 189, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
There is a significant difference across time for both regions, as we saw with the \(t\)-test. There is also a significant difference by region for each year. However, the p-values are a bit larger for the KS test results than we saw with the \(t\)-test.
p-values
The p in p-value stands for probability - the probability that we would obtain the statistics (for example the t in our student \(t\)-tests based on the means of our groups of comparison) just by random chance alone. Therefore a p-value of 0.02 means that there is a 2 percent chance that out data may look the way it does just because of random chance and not because there is really a difference in the means of the groups of interest.
So then if alpha is the threshold for the p-value for obtaining false significant results then the probability of not making incorrect conclusions is 1 - alpha:
P(Not making an error) = 1 - α
[1] 0.95
OK so if we use an alpha of .05 we are accepting that 95% of the time we will not make a Type 1 error and 5 % of the time we will just by random chance.
Taking this one step further:
P(Making an error) = 1 - P(Not making an error)
P(Making an error) = 1 - (1 - α)
Here we can see that this checks out:
[1] 0.05
So what happens if we perform multiple tests?
The probability of not making a type 1 error would remain the same for each test. Therefore we need to multiply the probabilities together each time to determine the over all probability of making an error across multiple tests. See here about why we multiply probabilities together.
P(Not making an error in m tests) = (1 - α)^m
P(Making at least 1 error in m tests) = 1 - (1 - α)^m
Let’s consider if we performed 10 different statistical tests and if we as usual considered the significance threshold alpha of .05:
So the probability of getting 1 significant result with 10 tests is:
[1] 0.4012631
So there is a 40% chance that that there will be a significant finding simply due to random chance alone.
What about 100 tests?
[1] 0.9940795
Yikes!! That is almost a 100% chance that there will be a significant finding simply due to chance alone!
Much of this explanation is described in this lecture.
One way to correct for this is multiple testing issue is using the Bonferroni method.
In this method we would divide our significance threshold (generally 0.05) by the number of tests.
[1] 0.0125
Our new significance threshold is now 0.0125. Thus our p-values should be less than this value for us to reject the null that there is no difference in means. In all cases, our p-values were less than 0.0125. So we see a significant difference in the means of the groups after multiple testing correction for our different tests.
Again, it would be reasonable to use the \(t\)-test because it is robust to deviations in normality when samples are relatively large. We can see that we obtained the same results regardless of the test that we used. However, if sample sizes are small (generally speaking n<15 for each group), then these nonparametric options are useful to know.
The D values in the output of our KS tests, show the magnitude of distance in the difference between the distributions of the groups tested. You may notice that the D value is larger for the tests of BMI across time rather then across region. In these tests the p-value was also smaller.